---
title: 'Sentence Chunker'
description: 'Split text into chunks while preserving sentence boundaries'
icon: 'align-left'
---

The `SentenceChunker` splits text into chunks while preserving complete sentences, ensuring that each chunk maintains proper sentence boundaries and context.

## API Reference
To use the `SentenceChunker` via the API, check out the [API reference documentation](../../api-reference/sentence-chunker).

## Installation

SentenceChunker is included in the base installation of Chonkie. No additional dependencies are required.

<Info>For installation instructions, see the [Installation Guide](/getting-started/installation).</Info>

## Initialization

```python
from chonkie import SentenceChunker

# Basic initialization with default parameters
chunker = SentenceChunker(
    tokenizer_or_token_counter="character",  # Default tokenizer (or use "gpt2", etc.)
    chunk_size=2048,                  # Maximum tokens per chunk
    chunk_overlap=128,               # Overlap between chunks
    min_sentences_per_chunk=1        # Minimum sentences in each chunk
)
```

## Parameters

<ParamField
    path="tokenizer_or_token_counter"
    type="Union[str, Callable, Any]"
    default="character"
>
    Tokenizer to use. Can be a string identifier ("character", "word", "gpt2", etc.) or a tokenizer instance
</ParamField>

<ParamField
    path="chunk_size"
    type="int"
    default="2048"
>
    Maximum number of tokens per chunk
</ParamField>

<ParamField
    path="chunk_overlap"
    type="int"
    default="0"
>
    Number of overlapping tokens between chunks
</ParamField>

<ParamField
    path="min_sentences_per_chunk"
    type="int"
    default="1"
>
    Minimum number of sentences to include in each chunk
</ParamField>

<ParamField
    path="min_characters_per_sentence"
    type="int"
    default="12"
>
    Minimum number of characters per sentence
</ParamField>

<ParamField
    path="approximate"
    type="bool"
    default="False"
>
    Use approximate token counting for faster processing.
    **Note**: This field is deprecated and will be removed in future versions.
</ParamField>

<ParamField
    path="delim"
    type="Union[str, List[str]]"
    default="['.', '!', '?', '\n']"
>
    Delimiters to split sentences on
</ParamField>

<ParamField
    path="include_delim"
    type='Optional[Literal["prev", "next"]]'
    default="prev"
>
    Include delimiters in the chunk text. If so, specify whether to include the previous or next delimiter.
</ParamField>


## Usage

### Single Text Chunking

```python
text = """This is the first sentence. This is the second sentence. 
And here's a third one with some additional context."""
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")
```

### Batch Chunking

```python
texts = [
    "First document. With multiple sentences.",
    "Second document. Also with sentences. And more context."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")
```

### Using as a Callable

```python
# Single text
chunks = chunker("First sentence. Second sentence.")

# Multiple texts
batch_chunks = chunker(["Text 1. More text.", "Text 2. More."])
```

## Supported Tokenizers

SentenceChunker supports multiple tokenizer backends:

- **TikToken** (Recommended)
  ```python
  import tiktoken
  tokenizer = tiktoken.get_encoding("gpt2")
  ```

- **AutoTikTokenizer**
  ```python
  from autotiktokenizer import AutoTikTokenizer
  tokenizer = AutoTikTokenizer.from_pretrained("gpt2")
  ```

- **Hugging Face Tokenizers**
  ```python
  from tokenizers import Tokenizer
  tokenizer = Tokenizer.from_pretrained("gpt2")
  ```

- **Transformers**
  ```python
  from transformers import AutoTokenizer
  tokenizer = AutoTokenizer.from_pretrained("gpt2")
  ```

## Return Type

SentenceChunker returns chunks as `SentenceChunk` objects with additional sentence metadata:

```python
@dataclass
class Sentence:
    text: str           # The sentence text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in sentence

@dataclass
class SentenceChunk(Chunk):
    text: str           # The chunk text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in chunk
    sentences: List[Sentence]  # List of sentences in chunk
```